The bioinformatics analysis pipeline nfcore/ampliseq is used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
Pipeline input was saved in folder input.
Sequencing data was provided in the samplesheet file
Samplesheet.tsv that is displayed below:
Metadata associated with the sequencing data was provided in
Metadata.tsv and is displayed below:
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. The sequence quality was checked using FastQC and resulting data was aggregated using the FastQC module of MultiQC. For more quality controls and per sample quality checks you can check the full MultiQC report, which can be found in multiqc/multiqc_report.html.
Cutadapt is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming were to be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to being labelled as PCR chimera. Primers were trimmed using cutadapt and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 27.5% of the sequences were discarded per sample and a mean of 86.3% of the sequences per sample passed the filtering. Cutadapt results can be found in folder cutadapt.
Additional quality filtering can improve sequence recovery. Often it is advised trimming the last few nucleotides to avoid less well-controlled errors that can arise there. Each read was truncated at the first instance of a quality score less than or equal to 2. Reads were trimmed to a specific length and the length cutoff was automatically determined by the median quality of all input reads. Reads were trimmed before median quality drops below 25 and at least 75% of reads are retained, resulting in a trim of forward reads at 230 bp and reverse reads at 229 bp, reads shorter than this were discarded. Reads with more than 2 expected errors were discarded. Read counts passing the filter are shown in section ‘Read counts per sample’ column ‘filtered’.
Quality profiles:
Forward (left) and reverse (right) read quality stats for incoming data:
Forward (left) and reverse (right) read quality stats for preprocessed data:
Overall read quality profiles are displayed as heat map of the
frequency of each quality score at each base position. The mean quality
score at each position is shown by the green line, and the quartiles of
the quality score distribution by the orange lines. The red line shows
the scaled proportion of reads that extend to at least that position.
Original plots can be found in folder dada2/QC/ with names that end in
_qual_stats.pdf.
DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.
DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.
Read error correction was performed using estimated error rates, visualized below. Error rates for forward reads are at the left side and reverse reads are at the right side.
Estimated error rates are displayed for each possible transition. The
black line shows the estimated error rates after convergence of the
machine-learning algorithm. The red line shows the error rates expected
under the nominal definition of the Q-score. The estimated error rates
(black line) should be a good fit to the observed rates (points), and
the error rates should drop with increased quality. Original plots can
be found in folder dada2/QC/ with names that
end in .err.pdf.
Tracking read numbers through DADA2 processing steps for each sample. The following table shows the read numbers after each processing stage. Processing stages are: input - read pairs into DADA2, filtered - read pairs passed quality filtering, denoisedF - forward reads after denoising, denoisedR - reverse reads after denoising, merged - successfully merged read pairs, nonchim - read pairs in non-chimeric sequences (final ASVs).
Samples with unusual low reads numbers relative to the number of expected ASVs should be treated cautiously, because the abundance estimate will be very granular and might vary strongly between (theoretical) replicates due to high impact of stochasticity.
Following, the numbers of the table above are shown in stacked barcharts as percentage of DADA2 input reads. Stacked barchart of read pair numbers (denoisedF & denoisedR halfed, because each pair is split) per sample and processing stage:
Between 55.34% and 59.02% reads per sample (average 57%) were retained for analysis within DADA2 steps.
The proportion of lost reads per processing stage and sample should not be too high, totalling typically <50%. Samples that are very different in lost reads (per stage) to the majority of samples must be compared with caution, because an unusual problem (e.g. during nucleotide extraction, library preparation, or sequencing) could have occurred that might add bias to the analysis.
Finally, 347 amplicon sequence variants (ASVs) were obtained across
all samples. The ASVs can be found in dada2/ASV_seqs.fasta. And the
corresponding quantification of the ASVs across samples is in dada2/ASV_table.tsv. An extensive
table containing both was saved as dada2/DADA2_table.tsv. ASVs were
inferred for each sample independently.
VSEARCH clustered 347 ASVs into 312 centroids with pairwise identity of 0.97. Clustered ASV sequences and abundances can be found in folder vsearch_cluster.
Barrnap classifies the ASVs into the origin domain (including mitochondrial origin).
Barrnap classified 248 ( 79.49 %) ASVs as most similar to Bacteria,
64 ( 20.51 %) ASVs to Archea, 0 ( 0 %) ASVs to Mitochondria, 0 ( 0 %)
ASVs to Eukaryotes, and 0 ( 0 %) were below similarity threshold to any
kingdom.
rRNA classification results can be found in folder barrnap.
ASVs were filtered for bac (bac: Bacteria,
arc: Archaea, mito: Mitochondria,
euk: Eukaryotes) using the above classification. The number
of ASVs was reduced by 64 (20.51%), from 312 to 248 ASVs.
The following table shows read counts for each sample before and after filtering:
In average 21.55 % reads were removed, but at most 35.23 % reads per sample.
A length filter was used to reduce potential contamination. Before filtering, ASVs had the following length profile (count of 1 was transformed to 1.5 to allow plotting on log10 scale):
Filtering omitted all ASVs with length above 255 bp.
The following table shows read counts for each sample before and after filtering:
In average 0.62 % reads were removed, but at most 1.76 % reads per sample.
The number of ASVs was reduced by 4 ( 1.61 %), from 248 to 244 ASVs.
Length filter results can be found in folder asv_length_filter.
The taxonomic classification was performed by DADA2 using the
database:
GTDB - Genome Taxonomy Database - Release R07-RS207. More
details about the reference taxonomy database can be found in the ‘Methods section’. A minimum bootstrap confidence of
50 (out of 100 trials) was used for assigning a taxonomic level.
The taxonomic reference database was cut by primer sequences to improve matching. The original database had 40,660 sequences with 52,197,725 bp, retained were 17,772 (43.7%) sequences that represented 4,575,891 bp (8.8%).
DADA2 classified 100 % ASVs at Kingdom level, 89.75 % ASVs at Phylum level, 84.02 % ASVs at Class level, 71.31 % ASVs at Order level, 63.52 % ASVs at Family level, 53.69 % ASVs at Genus level, 46.31 % ASVs at Species level, 2.87 % ASVs at Species_exact level.
DADA2 taxonomy assignments can be found in folder dada2 in files ASV_tax_*.tsv.
The taxonomic classification was performed by QIIME2
using the database:
Greengenes 16S - Version 13_8 - clustered at 85% similarity - for testing purposes only.
More details about the reference taxonomy database can be found in the
‘Methods section’.
QIIME2 classified 100 % ASVs at Kingdom level, 87.3 % ASVs at Phylum level, 70.08 % ASVs at Class level, 30.33 % ASVs at Order level, 18.03 % ASVs at Family level, 6.97 % ASVs at Genus level, 0.41 % ASVs at Species level.
QIIME2 taxonomy assignments can be found in folder qiime2/taxonomy.
Files that were input to QIIME2 can be found in folder qiime2/input/. Results of taxonomic classification of DADA2 was used in all following analysis, see in the above sections.
Unwanted taxa are often off-targets generated in PCR with primers
that are not perfectly specific for the target DNA. For 16S rRNA
sequencing mitrochondria and chloroplast sequences are typically removed
because these are frequent unwanted non-bacteria PCR products. ASVs were
removed when the taxonomic string contained any of
mitochondria,chloroplast (comma separated), had fewer than
10 total read counts over all samples, or that were present in fewer
than 2 samples. Consequently, 244 ASVs were reduced by 204 ( 83.61 %) to
40 . The following table shows read counts for each sample before and
after filtering:
Tables with read count numbers and filtered abundance tables are in folder qiime2/abundance_tables.
The abundance tables are the final data for further downstream analysis and visualisations. The tables are based on the computed ASVs and taxonomic classification, but after removal of unwanted taxa. Folder qiime2/abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Absolute abundance tables produced by the previous steps contain count data, but the compositional nature of 16S rRNA amplicon sequencing requires sequencing depth normalisation. This step computes relative abundance tables using TSS (Total Sum Scaling normalisation) for various taxonomic levels and detailed tables for all ASVs with taxonomic classification, sequence and relative abundance for each sample. Typically used for in depth investigation of taxa abundances. Folder qiime2/rel_abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Interactive abundance plot that aids exploratory browsing the discovered taxa and their abundance in samples and allows sorting for associated meta data. Folder qiime2/barplot contains barplots, click qiime2/barplot/index.html to open it in your web browser.
Additionally, barplots with average relative abundance values were
produced for treatment1,badpairwise10 (comma separated if
several) in qiime2/barplot_average in separate
folders following the scheme barplot_{treatment}:
Produces rarefaction plots for several alpha diversity indices, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the slope of the curves does not level out and the lines do not become horizontal, this might be because the sequencing depth was too low to observe all diversity or that sequencing error artificially increases sequence diversity and causes false discoveries.
Folder qiime2/alpha-rarefaction contains the data, click qiime2/alpha-rarefaction/index.html to open it in your web browser.
Diversity measures summarize important sample features (alpha diversity) or differences between samples (beta diversity). Diversity calculations are based on sub-sampled data rarefied to 543 counts.
Alpha diversity measures the species diversity within samples. Please note that ASVs were inferred for each sample independently, that can make alpha diversity indices a poor estimate of true diversity. This step calculates alpha diversity using various methods and performs pairwise comparisons of groups of samples. It is based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/alpha_diversity contains the alpha-diversity data:
Alpha diversity is considered not trustworthy when it correlates positively with sequencing depth. Spearman’s rank correlation was calculated for total counts per sample after all filtering steps (in folder qiime2/abundance_tables) with alpha diversity measures. No significant positive correlation was found between alpha diversity and sample counts:
Scatter plots with linear regression line (blue) with 95% confidence interval (gray shaded area).
Beta diversity measures the species community differences between samples. This step calculates beta diversity distances using various methods and performs pairwise comparisons of groups of samples. Additionally, principle coordinates analysis (PCoA) plots are produced that can be visualized with Emperor in your default browser without the need for installation. These calculations are based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/beta_diversity contains the beta-diverity data:
Statistics on differences between specific metadata groups that can
be found in folder qiime2/diversity/beta_diversity/.
Each significance test result is in its separate folder following the
scheme {method}_distance_matrix-{treatment}:
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-treatment1/index.html
Permutational multivariate analysis of variance using distance
matrices adonis
(in VEGAN)
determines whether groups of samples are significantly different from
one another. The formula was treatment1,mix8 (multiple
formulas are comma separated). adonis computes an R2 value (effect size)
which shows the percentage of variation explained by a condition, as
well as a p-value to determine the statistical significance. The
sequence of conditions in the formula matters, the variance of factors
is removed (statistically controlled for) from beginning to end of the
formula.
Test results are in separate folders following the scheme
{method}_distance_matrix-{adonis formula}:
qiime2/diversity/beta_diversity/adonis/bray_curtis_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/adonis/bray_curtis_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/adonis/jaccard_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/adonis/jaccard_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/adonis/unweighted_unifrac_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/adonis/unweighted_unifrac_distance_matrix-treatment1/index.html
qiime2/diversity/beta_diversity/adonis/weighted_unifrac_distance_matrix-mix8/index.html
qiime2/diversity/beta_diversity/adonis/weighted_unifrac_distance_matrix-treatment1/index.html
Analysis of
Composition of Microbiomes with Bias Correction (ANCOM-BC) is
applied to identify features that are differentially abundant across
sample groups. Comparisons between groups of samples is performed for
specific metadata that can be found in folder qiime2/ancombc/ and qiime2/ancombc_formula/. Test
results are in separate folders following the scheme
Category-{treatment}-{taxonomic level}:
qiime2/ancombc/da_barplot/Category-badpairwise10-ASV/index.html
qiime2/ancombc/da_barplot/Category-badpairwise10-level-2/index.html
qiime2/ancombc/da_barplot/Category-badpairwise10-level-3/index.html
qiime2/ancombc/da_barplot/Category-badpairwise10-level-4/index.html
qiime2/ancombc/da_barplot/Category-treatment1-ASV/index.html
qiime2/ancombc/da_barplot/Category-treatment1-level-2/index.html
qiime2/ancombc/da_barplot/Category-treatment1-level-3/index.html
qiime2/ancombc/da_barplot/Category-treatment1-level-4/index.html
qiime2/ancombc_formula/da_barplot/Category-treatment1-ASV/index.html
qiime2/ancombc_formula/da_barplot/Category-treatment1-level-2/index.html
qiime2/ancombc_formula/da_barplot/Category-treatment1-level-3/index.html
qiime2/ancombc_formula/da_barplot/Category-treatment1-level-4/index.html
The Swedish Biodiversity
Infrastructure (SBDI) provides a cost-effective, cutting-edge
infrastructure that supports Swedish and international biodiversity and
ecosystems research. Files in preparation for submission to SBDI can be
found in folder SBDI. Tables are generated from
the DADA2 denoising and taxonomy assignment steps. Each table, except
annotation.tsv, corresponds to one tab in the SBDI submission
template. Most of the fields in the template will not be populated,
but if you run nf-core/ampliseq with a sample metadata table
(--metadata) any fields corresponding to a field in the
template will be used.
Microbiome data can be analysed and visualized with certain R packages. For convenience, R objects in RDS format are provided.
Phyloseq objects and can be loaded directely into R with package ‘phyloseq’. The objects contain an ASV abundance table and a taxonomy table. If available, metadata and phylogenetic tree will also be included in the phyloseq object. The files can be found in folder phyloseq.
TreeSummarizedExperiment (TreeSE, TSE) objects can be loaded into R with package ‘TreeSummarizedExperiment’. and contain an ASV abundance table, a taxonomy table, and sequences. If available, metadata and phylogenetic tree will also be included in the object. The files can be found in folder treesummarizedexperiment.
Data was processed using nf-core/ampliseq version 2.15.0 (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
Data quality was evaluated with FastQC (Andrews, 2010) and summarized with MultiQC (Ewels et al., 2016). Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 27.5% of the sequences were discarded per sample and a mean of 86.3% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed sample-wise (independent) with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, truncate reads at the first instance of a quality score less than or equal to 2, trim reads (before median quality drops below 25 and at least 75% of reads are retained; forward reads at 230 bp and reverse reads at 229 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 347 amplicon sequencing variants (ASVs) were obtained across all samples. Between 55.34% and 59.02% reads per sample (average 57%) were retained. The ASV count table contained in total 4921 counts, at least 1030 and at most 1387 per sample (average 1230).
VSEARCH (Rognes et al.,
2016) clustered 347 ASVs into 312 centroids with pairwise identity
of 0.97. Barrnap (Seemann,
2013) filtered ASVs for bac (bac:
Bacteria, arc: Archaea, mito: Mitochondria,
euk: Eukaryotes), 64 ASVs were removed with less than
35.23% counts per sample (248 ASVs passed). 4 ASVs with length above 255
bp were removed with less than 1.76% counts per sample (244 ASVs
passed).
Taxonomic classification was performed by DADA2 (with a minimum
bootstrap confidence of 50 out of 100 trials for assigning a taxonomic
level) and the database ‘GTDB - Genome Taxonomy Database - Release
R07-RS207’
(Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018 Nov;36(10):996-1004. doi: 10.1038/nbt.4229. Epub 2018 Aug 27. PMID: 30148503.)
that had 17,772 (43.7%) sequences extracted by PCR primers to improve
assignments, QIIME2 and the database ‘Greengenes 16S - Version 13_8 -
clustered at 85% similarity - for testing purposes only’
(McDonald, D., Price, M., Goodrich, J. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012). https://doi.org/10.1038/ismej.2011.139).
ASV sequences, abundance and DADA2 taxonomic assignments were loaded into QIIME2 (Bolyen et al., 2019). Of 244 ASVs, 204 were removed because the taxonomic string contained any of (mitochondria,chloroplast), had fewer than 10 total read counts over all samples, were present in fewer than 2 samples (40 ASVs passed). Within QIIME2, the final microbial community data was visualized in a barplot, evaluated for sufficient sequencing depth with alpha rarefaction curves, investigated for alpha (within-sample) and beta (between-sample) diversity after rarefaction to 543 counts, used to find differential abundant taxa with ANCOM-BC (Lin and Peddada, 2020).
WARNING This methods section is lacking software versions, these can be found in MultiQC’s report section Software Versions or in folder pipeline_info file
software_versions.yml.
Taxonomic classification by DADA2:
database:
GTDB - Genome Taxonomy Database - Release R07-RS207
files:
[https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_reps/bac120_ssu_reps_r207.tar.gz, https://data.gtdb.ecogenomic.org/releases/release207/207.0/genomic_files_reps/ar53_ssu_reps_r207.tar.gz]
citation:
Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018 Nov;36(10):996-1004. doi: 10.1038/nbt.4229. Epub 2018 Aug 27. PMID: 30148503.
Taxonomic classification by QIIME2:
database:
Greengenes 16S - Version 13_8 - clustered at 85% similarity - for testing purposes only
files:
[https://data.qiime2.org/2023.7/tutorials/training-feature-classifiers/85_otus.fasta, https://data.qiime2.org/2023.7/tutorials/training-feature-classifiers/85_otu_taxonomy.txt]
citation:
McDonald, D., Price, M., Goodrich, J. et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012). https://doi.org/10.1038/ismej.2011.139
MultiQC summarized computational methods in multiqc/multiqc_report.html. The proposed short methods description can be found in MultiQC’s Methods Description, versions of software collected at runtime in MultiQC’s Software Versions, and a summary of non-default parameter in MultiQC’s Workflow Summary.
Technical information to the pipeline run are collected in folder pipeline_info, including software versions
collected at runtime in file software_versions.yml (can be
viewed with a text editor), all parameter settings in file
params_{date}_{time}.json (can be viewed with a text
editor), execution report in file
execution_report_{date}_{time}.html, execution trace in
file execution_trace_{date}_{time}.txt, execution timeline
in file execution_timelime_{date}_{time}.html, and pipeline
direct acyclic graph (DAG) in file
pipeline_dag_{date}_{time}.html.
This report (file summary_report.html) is located in
folder summary_report of the original pipeline results
folder. In this file, all links to files and folders are relative,
therefore hyperlinks will only work when the report is at its original
place in the pipeline results folder. Plots specifically produced for
this report (if any) can be also found in folder summary_report.
A comprehensive read count report throughout the pipeline can be
found in the base results folder in file
overall_summary.tsv.
Please cite the pipeline publication and any software tools used by the pipeline (see citations) when you use any of the pipeline results in your study.